Skip to content

Conversation

@ybgao-nvidia
Copy link
Contributor

@ybgao-nvidia ybgao-nvidia commented Sep 5, 2025

What does this PR do ?

Help mitigate the performance overheads from clearing cache which is introduced in #926 by making clear cache behaviour optional.

Issues

This PR resolves #1036.

Usage

This PR introduces an additional option for dtensor_config. To clear cache between microbatch iterations, add:

policy:
    dtensor_cfg:
        # .....
        clear_cache_every_n_steps: 1

A warning will be displayed before a training iteration indicating that clear cache has been turned on and its potential performance overheads.

(DTensorPolicyWorker[rank=0] pid=3016383) /lustre/fs1/portfolios/coreai/users/yubog/RL/nemo_rl/models/policy/dtensor_policy_worker.py:635: UserWarning: Emptying cache every 1 microbatches
, doing so unnnecessarily would incur a large performance overhead.
(DTensorPolicyWorker[rank=0] pid=3016383)   warnings.warn(f"Emptying cache every {empty_cache_steps} microbatches, doing so unnnecessarily would incur a large performance overhead.")

Below is the speed comparison of the default SFT experiment with clear cache every 1 microbatch vs. no clear cache:
image

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

@github-actions
Copy link

github-actions bot commented Sep 5, 2025

ℹ️ File Consistency Check

Check based on commit: 9a9f638 (PR #1074 from ybgao/sep4-optional-clear-cache)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

Signed-off-by: Yubo Gao <[email protected]>
@ybgao-nvidia ybgao-nvidia added the CI:L1 Run doctests, unit tests, and functional tests label Sep 5, 2025
@github-actions
Copy link

github-actions bot commented Sep 5, 2025

ℹ️ File Consistency Check

Check based on commit: 852e253 (PR #1074 from ybgao/sep4-optional-clear-cache)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

1 similar comment
@github-actions
Copy link

github-actions bot commented Sep 5, 2025

ℹ️ File Consistency Check

Check based on commit: 852e253 (PR #1074 from ybgao/sep4-optional-clear-cache)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@ybgao-nvidia ybgao-nvidia marked this pull request as ready for review September 5, 2025 04:49
Signed-off-by: Yubo Gao <[email protected]>
@github-actions
Copy link

github-actions bot commented Sep 5, 2025

ℹ️ File Consistency Check

Check based on commit: cf551e5 (PR #1074 from ybgao/sep4-optional-clear-cache)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@ybgao-nvidia ybgao-nvidia added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Sep 5, 2025
@github-actions
Copy link

github-actions bot commented Sep 5, 2025

ℹ️ File Consistency Check

Check based on commit: cf551e5 (PR #1074 from ybgao/sep4-optional-clear-cache)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link

github-actions bot commented Sep 5, 2025

ℹ️ File Consistency Check

Check based on commit: e931727 (PR #1074 from ybgao/sep4-optional-clear-cache)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link

github-actions bot commented Sep 6, 2025

ℹ️ File Consistency Check

Check based on commit: 76c0a43 (PR #1074 from ybgao/sep4-optional-clear-cache)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@terrykong terrykong enabled auto-merge September 7, 2025 06:50
@terrykong terrykong added this pull request to the merge queue Sep 7, 2025
Merged via the queue into main with commit 46f5edd Sep 7, 2025
23 checks passed
@terrykong terrykong deleted the ybgao/sep4-optional-clear-cache branch September 7, 2025 09:50
guyueh1 pushed a commit to guyueh1/NeMo-RL that referenced this pull request Sep 15, 2025
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

policy_training's speed regression in both dtensor v1/v2 path

3 participants